-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus stats: Correctly group lines of the same metric name. #10833
Conversation
Fixes envoyproxy#10073 Signed-off-by: Greg Greenway <ggreenway@apple.com>
Question for anyone with knowledge of the exposition format (https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#grouping-and-sorting): The format says How does this apply to histograms? Is a single histogram, which is composed of several
Or
|
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
source/server/http/stats_handler.cc
Outdated
auto generate_counter_and_gauge_output = | ||
[](const auto& metric) -> std::pair<std::string, std::string> { | ||
const std::string tags = formattedTags(metric.tags()); | ||
const std::string metric_name = metricName(metric.tagExtractedName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the explosion of string-storage during the sort. Can we populate a vector of metrics, and sort them by Metric::tagExtractedStatName()? StatNames are sortable.
Then we could stream out the formatted value?
I realize that at the moment the admin output infrastructure buffers everything, so I'm not under the delusion we can in this PR have the prometheus handler consume constant space. But I think we could make it consume half the space it consumes with this approach, and that's probably a significant memory bump. And I think it wouldn't be hugely difficult to update the admin output infrastructure to stream rather than fully buffer.
There was a talk at KubeCon by Splunk about stats cardinality blowing through hundreds of gigs of memory in Prometheus (slides) and the source of the stats explosion was Istio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also to sort by StatName you can use:
envoy/source/common/stats/symbol_table_impl.h
Line 697 in 62777e8
struct StatNameLessThan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worried about this too. Let me take a pass at this and see how it comes out.
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
@jmarantz I think I've fixed the memory usage. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty cool! just a couple of minor suggestions.
@rulex123 FYI as you are messing around in the same code though I don't see a conflict. Note that with this PR we move closer to a point where we don't need to hold the entire copy of the stringified stats output in memory during an admin request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great generally, just nit-picking style now.
include/envoy/stats/refcount_ptr.h
Outdated
@@ -25,6 +25,8 @@ namespace Stats { | |||
// API in years. | |||
template <class T> class RefcountPtr { | |||
public: | |||
using element_type = T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc this? And also consider spelling it ElementType per Envoy style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definition is taken from std::shared_ptr. I'll doc it.
source/server/http/stats_handler.cc
Outdated
* @param generate_output A std::function<std::string(const MetricType& metric, const std::string& | ||
* prefixedTagExtractedName)> which returns the output text for this metric. | ||
*/ | ||
auto process_type = [&response, ®ex, used_only](const auto& metrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether you could use templates in some way to reduce the use of auto
-- the auto
for the generated function is OK but it'd be nicer to be explicit about what type metrics and generate_output were.
See https://google.github.io/styleguide/cppguide.html#Type_deduction for discussion.
I think my concrete suggestion is to break up most of this code into a private helper method processType
which is templatized. Then you could have the body of statsAsPrometheus() simply be:
uint64_t metric_name_count = 0;
metric_name_count += processType<Stats::Counter>(counters, generate_counter_and_gauge_output, "counter");
metric_name_count += processType<Stats::Gauge>(gauges, generate_counter_and_gauge_output, "gauge");
metric_name_count += processType<Stats::Histogram>(histograms, generate_histogram_output, "histogram");
I think that might be clearer than using the lambda with the auto...WDYT?
That would also remove the need for the element_type
in the refcount, at least for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both metrics
and generate_output
are templatized types because of how Counter
, Gauge
, and Histogram
are defined. There isn't a common/inherited defintion for the functions that get the data. I had tried to make these non-auto, and it isn't possible.
I can make the helpers into template methods instead of a lambda, and I agree that may make this a little easier to follow. Some of the types will become more concrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once you make the helper into an explicit template method you'll be able to drop the auto
, though I could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it came out better than I expected, although I had to explicitly specify template parameters in some cases that were unexpected to me. But I think it is more readable.
/wait |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great...thank you!
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Haven't seen this error before:
/retest |
🔨 rebuilding |
Signed-off-by: Spencer Lewis <slewis@squareup.com> * master: fault injection: add support for setting gRPC status (envoyproxy#10841) tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940) Fix typo on Postgres Proxy documentation. (envoyproxy#10930) fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931) gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508) http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941) stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735) hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929) prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833) status: Fix ASAN error in Status payload handling (envoyproxy#10906) path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922) compressor filter: add benchmark (envoyproxy#10464) xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934) ci: update before purge in cleanup (envoyproxy#10938) tracer: Improve test coverage for x-ray (envoyproxy#10890) Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)
Signed-off-by: Greg Greenway ggreenway@apple.com
Description: The prometheus exposition format requires all metrics of the same name (without tags) be grouped contiguously in the output. Additionally, it specifies that it is preferred for the stats to be output in the same order every time they are produced. Fix Envoy to comply with both of these constraints by taking an extra pass to collect all the stats so that they can be sorted before the final output is generated.
Risk Level: Low
Testing: New unit test
Docs Changes: Not needed
Release Notes: Added
Fixes #10073